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FINAL  REPORT 

Computerized  Adaptive  Measurement  of  Achievement  and  Ability 


Objectives 

This  research  program  was  designed  to  investigate  the  applications  of  item 
response  theory  (IRT)  and  computerized  adaptive  testing  to  the  unique  problems 
of  the  measurement  of  ability  and  the  measurement  of  achievement.  Specific  ob¬ 
jectives  relevant  to  these  two  areas  were  as  follows: 


Adaptive  Achievement  Testing 


To  study  the  relative  efficiency  of  various  approaches  to  intersubtest 
branching  in  achievement  test  batteries. 

J 

\ 

21  To  investigate  the  dimensionality  of  measured  achievement  over  time. 


32 


0 


To  study  the  applicability  of  IRT  models  to  the  problem  of  mastery  testing 
and  to  compare  models  for  adaptive  mastery  testing  with  other  approaches  to 
the  improvement  of  mastery  decisions  and/or  reduction  in  test  length  in  mas¬ 
tery  testing. 

J 

To  explicate  the  concept  of  Adaptive  Self-Referenced  Testing  and  to  examine 
its  applicability  to  the  achievement  testing  problem,  v 

J  \ 


Adaptive  Ability  Testing 


5*  To  evaluate  the  performance  of  adaptive  testing  strategies  under  conditions 
which  more  reasonably  represent  the  conditions  under  which  these  strategies 
might  be  used,  and  to  examine  the  performance  of  adaptive  testing  strategies 
in  live  testing.  / 

J 

6a  To  evaluate  the  utility  for  adaptive  testing  of  response  modes  and  test  item 
formats  usable  in  adaptive  ability  testing. 

Research  in  pursuance  of  these  objectives  began  in  February  1979  and  continued 
through  April  1983. 


Approach 

The  research  utilized  a  combination  of  monte  carlo  simulation  studies  and 
live-testing  studies.  ^ 

Adaptive  Achievement  Testing 

Intersubtest  branching.  Intersubtest  branching  is  an  approach  to  the  uti¬ 
lization  of  adaptive  testing  methodologies  in  a  multidimensional  item  pool.  In 
intersubtest  branching,  IRT  item  parameters  are  estimated  separately  for  each 
subtest  of  a  multisubtest  battery.  Using  any  of  a  number  of  adaptive  testing 
strategies,  adaptive  testing  occurs  within  the  subtest  based  on  appropriate  item 
selection  rules  and  a  test  termination  criterion  appropriate  for  the  purpose  of 
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testing.  Upon  completion  of  a  subtest  in  the  test  battery,  the  final  trait  lev¬ 
el  estimate  (§)  is  then  used  as  an  entry  point  to  begin  testing  in  a  subsequent 
subtest  in  the  battery.  As  originally  proposed,  subtests  in  a  battery  are  or¬ 
dered  by  the  magnitudes  of  the  squared  multiple  correlations  of  each  subtest 
with  all  other  subtests  in  the  battery.  In  this  way,  the  entry  points  for  adap¬ 
tive  testing  in  each  subtest  utilize  the  information  available  in  the  tests  in 
the  test  battery  that  were  most  highly  correlated  with  it,  which  should  shorten 
the  adaptive  tests  for  later  subtests  in  the  battery  as  much  as  possible. 

Intersubtest  adaptive  branching  was  studied  by  real-data  simulation  in  Re¬ 
search  Report  79-6,  and  by  monte  carlo  simulation  in  Research  Report  80-4.  The 
study  reported  in  Research  Report  79-6  used  data  from  conventionally-administer¬ 
ed  tests  which  were  analyzed  as  if  they  had  been  administered  as  an  adaptive 
test,  and  the  intersubtest  branching  strategy  was  applied  to  these  data.  This 
study  was  designed  to  separate  the  effects  of  the  adaptive  intrasubtest  item 
selection  procedure  from  those  effects  due  to  intersubtest  branching.  The  study 
also  (1)  allowed  evaluation  of  the  effects  of  different  intrasubtest  termination 
criteria,  (2)  investigated  the  effect  of  taking  into  account  errors  of  measure¬ 
ment  in  the  multiple  regression  procedure  used  to  determine  test  entry  points, 
and  (3)  investigated  the  stability  of  the  regression  equations  in  cross-valida- 
t  ion . 


Other  aspects  of  the  intersubtest  branching  strategy  when  applied  to  an 
achievement  test  battery  were  investigated  by  monte  carlo  simulation  in  Research 
Report  81-4.  Questions  of  interest  in  this  study  included  (1)  the  effects  of 
varying  subtest  order,  (2)  the  utilization  of  different  subtest  termination  cri¬ 
teria,  and  (3)  the  effect  of  variable  versus  fixed  entry  on  the  psychometric 
properties  of  the  intersubtest  branching  strategy.  Dependent  variables  included 
(l)  reductions  in  test  length,  (2)  effect  on  test  information,  and  (3)  correla¬ 
tions  between  achievement  estimates  and  true  achievement  levels.  The  study  de¬ 
sign  also  permitted  separation  of  the  effects  of  intrasubtest  and  intersubtest 
adaptive  branching. 

The  dimensionality  of  measured  achievement  over  time.  The  effects  of  in¬ 
struction  on  measured  achievement  are  usually  measured  at  a  single  point  in 
time.  That  is,  some  instruction  is  given  to  an  individual  and  at  the  end  of  the 
period  of  instruction  an  achievement  test  is  used  to  determine  whether  the  indi¬ 
vidual  has  reached  an  appropriate  level  of  achievement.  On  the  basis  of  such 
information,  aggregated  across  individuals,  decisions  are  frequently  made  about 
the  adequacy  of  instructional  programs,  or  about  the  impact  (or  lack  thereof)  of 
instruction  on  a  specific  individual. 

A  more  powerful  approach  to  the  measurement  of  achievement  would  involve 
the  use  of  pretests  and  posttests  to  determine  if  any  change  has  occurred  in 
measured  achievement  over  time.  Using  change  scores,  however,  implies  that  the 
variable  being  measured  is  the  same  at  pretest  as  it  is  at  posttest.  There  has 
been  very  little  empirical  data  available  concerning  this  issue. 

Research  Report  81-5  was  designed  to  investigate  the  question  of  whether 
the  achievement  factor  identified  at  pretest  in  an  achievement  test  is  the  same 
factor  identified  at  posttest.  Two  studies  utilized  data  on  groups  of  college 
students  from  measured  achievement  in  mathematics  classes  and  biology  classes. 
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Achievement  test  item  responses  were  factor  analyzed  prior  to  instruction,  and 
again  at  the  end  of  instruction*  In  addition,  mean  differences  in  test  scores 
at  pretest  and  posttest  were  analyzed*  Factors  obtained  at  pretest  were  com¬ 
pared  with  those  obtained  at  posttest  to  determine  if  the  same  factor  was  found 
prior  to  and  after  instruction. 

Kingsbury  (1984)  directly  examined  the  characteristics  of  change  scores 
derived  from  adaptive  and  conventional  tests.  This  study  utilized  data  from 
college-level  biology  examinations*  Both  adaptive  and  conventional  tests  were 
administered  in  a  complex  design  to  groups  of  students  in  such  a  way  that  relia¬ 
bilities  of  the  change  scores  could  be  determined  separately  for  the  two  types 
of  tests  in  a  number  of  different  homogeneous  content  areas  covered  in  the 
course.  The  question  raised  by  this  study  was  based  on  hypotheses  that  the  more 
precise  achievement  level  estimates  resulting  from  adaptive  testing  should  also 
result  in  more  reliable  change  scores  in  comparison  to  those  from  conventional 
testing.  Also  studied  was  the  effect  of  variable-  versus  fixed-length  test  ter¬ 
mination  on  the  adaptive  tests. 

Adaptive  Mastery  Testing*  Adaptive  mastery  testing  (AMT)  combines  IRT  and 
adaptive  testing  into  an  efficient  strategy  for  making  mastery  or  classification 
decisions.  In  this  procedure,  items  used  to  make  a  mastery  decision  are  select¬ 
ed  by  an  IRT  maximum  information  adaptive  testing  strategy.  Item  responses  are 
scored  using  a  Bayesian  0  estimation  procedure,  and  a  confidence  or  credibility 
interval  is  computed  for  the  0  estimate.  The  confidence  Interval  around  the 
estimate  is  then  compared  to  a  mastery  cutoff  score,  which  is  also  expressed  on 
the  0  metric.  A  mastery  decision  is  determined  on  the  basis  of  whether  the 
credibility  interval  overlaps  with  the  mastery  criterion  level,  and  on  which 
side  of  the  mastery  cutoff  score  the  individual’s  0  estimate  falls. 

Both  monte  carlo  simulation  and  live  testing  were  used  to  investigate  char¬ 
acteristics  of  the  AMT  strategy  and  to  compare  it  with  other  approaches  for  mak¬ 
ing  mastery  decisions.  In  Research  Report  80-4  (also  Kingsbury  &  Weiss,  1980a) 
the  AMT  procedure  was  compared  to  a  conventionally-based  mastery  testing  proce¬ 
dure  and  to  a  procedure  based  on  Wald’s  sequential  probability  ratio  test.  The 
procedures  were  compared  in  terms  of  their  efficiency,  based  on  the  test  length 
required  by  the  procedures  to  make  a  classification  decision,  on  the  validity  of 
the  decisions  made  by  each  procedure,  and  on  the  type  of  classifications  made  by 
each  of  the  three  testing  procedures. 

To  examine  the  generality  of  the  findings  in  live  testing,  in  Research  Re¬ 
port  81-3  the  AMT  procedure  and  a  conventional  test  were  administered  to  stu¬ 
dents  in  a  biology  class.  Contrary  to  earlier  studies  which  examined  the  AMT 
procedure,  actual  adaptive  mastery  tests  were  administered  to  one  subgroup  of 
students  while  the  other  received  computer-administered  conventional  tests.  The 
performance  of  the  two  testing  strategies  was  evaluated  in  terms  of  a  mastery 
criterion  based  on  the  students'  final  s rending  in  the  course,  which  was  a  com¬ 
bination  of  their  performance  on  course  examinations  and  laboratory  grades. 

Adaptive  Self-Referenced  Testing.  Adaptive  self-referenced  testing  (ASRT) 
is  a  combination  of  IRT  and  adaptive  testing  designed  to  permit  the  efficient 
measurement  of  changes  in  achievement  levels  due  to  exposure  to  instruction. 

This  procedure  is  designed  to  measure  individual  changes  in  achievement  in  a 


unidimensional  item  pool  in  a  very  efficient  manner  at  a  number  of  points  of 
instruction*  It  is  thus  an  appropriate  conceptualization  for  tracking  individu¬ 
al  changes  due  to  instruction  at  a  number  of  points  during  a  course,  since  it 
permits  an  instructor  to  evaluate  an  individuals  performance  on  a  minimum  num¬ 
ber  of  items  at  each  of  a  number  of  testing  occasions. 

ASRT  permits  an  instructor  to  measure  a  student  early  in  a  course,  such  as 
on  the  first  day,  and  as  frequently  as  is  necessary  during  the  course.  Based  on 
adaptive  testing  methodology,  the  data  obtained  from  the  Time  1  testing  are  used 
as  the  entry  point  to  Time  2  adaptive  test  administration,  and  this  process  is 
followed  for  any  number  of  test  administrations.  In  addition,  test  termination 
at  any  point  in  time  can  be  based  on  the  standard  error  band  associated  with  an 
individuals  6  estimate  at  that  point  in  time.  ASRT  is  designed  to  simultane¬ 
ously  permit  intraindividual  measurement  of  change,  norm-based  measurement  on 
the  0  metric  which  can  then  be  converted  to  the  proportion-correct  measurement 
if  desired,  and  a  mastery-based  (criterion-referenced)  achievement  level  esti¬ 
mate  utilizing  the  procedures  of  AMT.  While  no  research  directly  related  to 
ASRT  was  done  during  the  contract  period,  the  method  was  described  in  some  de¬ 
tail  in  Weiss  &  Kingsbury  (1984).  Both  Research  Report  81-5  and  the  Kingsbury 
(1984)  study  have  implications  for  the  use  of  ASRT  and  its  future  development. 

Adaptive  Ability  Testing 

Adaptive  testing  strategies.  A  major  focus  of  this  research  program  was  on 
the  evaluation  of  different  approaches  to  computerized  adaptive  testing.  While 
earlier  projects  were  concerned  primarily  with  evaluating  the  relative  perfor¬ 
mance  of  adaptive  and  conventional  testing  strategies,  in  this  project  the  focus 
was  on  the  IRT-based  strategies  and  on  their  performance  under  a  variety  of  con¬ 
ditions.  An  overview  of  some  aspects  of  project  research  is  given  by  Weiss 
(1982). 

The  performance  of  a  Bayesian  adaptive  testing  strategy  was  reported  in 
Research  Report  83-2  (also  Weiss  &  McBride,  1984).  Owen’s  Bayesian  adaptive 
testing  strategy  was  examined  in  three  studies  which  utilized  an  accurate  prior 
0  estimate,  a  constant  prior  Q  estimate  with  fixed  test  length,  and  a  constant 
prior  6  estimate  with  variable  test  length.  The  performance  of  the  adaptive 
testing  strategy  was  examined  in  terms  of  the  bias  and  information  of  the  0  es¬ 
timates  as  a  function  of  0.  Also  examined  was  the  mean  number  of  items  adminis¬ 
tered  in  the  variable  test  length  condition. 

A  major  concern  of  the  research  was  to  evaluate  the  performance  of  adaptive 
testing  strategies  under  conditions  of  increasing  realisticness.  Prior  to  these 
studies,  all  studies  evaluating  the  performance  of  adaptive  testing  strategies 
did  so  under  reasonably  unrealistic  conditions.  While  characteristics  of  the 
item  pools  varied  in  these  earlier  studies,  the  IRT  item  parameters  used  in 
these  simulation  studies  were  considered  to  be  accurate.  However,  in  real  item 
pools,  there  is  always  some  error  associated  with  the  item  parameter  estimates. 
Since  adaptive  testing  is  designed  to  select  items  on  the  basis  of  these  item 
parameter  estimates,  it  can  be  assumed  that  any  degree  of  inaccuracy  in  the  item 
parameter  estimates  will  have  detrimental  effects  on  the  performance  of  adaptive 
testing  strategies. 
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Consequently,  two  studies  were  designed  to  investigate  effects  of  errors  in 
item  parameter  estimates  on  the  performance  ot  maximum  information  and  Bayesian 
adaptive  testing  strategies.  The  first  study  (Crichton,  1981)  assessed  the 
effects  of  errors  in  item  parameter  estimates  in  the  context  of  the  three-param¬ 
eter  logistic  model.  Crichton  compared  the  performance  of  the  two  IRT-based 
adaptive  testing  strategies — maximum  information  and  Bayesian — with  the  strati¬ 
fied  adaptive  ( s tr adapt ive )  strategy,  on  the  hypothesis  that  the  stradaptive 
strategy  should  be  less  sensitive  to  errors  in  the  item  parameter  estimates. 

Her  monte  carlo  simulation  study  varied  test  length  from  5  to  30  items.  Test 
length  was  then  crossed  with  three  levels  of  error  in  the  discrimination  (a) 
parameter,  four  levels  of  error  in  estimates  for  the  difficulty  (_b)  parameter, 
and  two  levels  of  error  in  the  pseudo-guessing  (£)  parameter.  In  addition  to 
considering  these  effects  for  a_,  _b,  and  £  separately,  two  datasets  examined  the 
effects  of  joint  errors  in  the  £,  _b,  and  c_  parameters.  Dependent  variables  con¬ 
ditional  on  0  included  the  bias,  root  mean  square  error,  inaccuracy,  and  infor¬ 
mation  in  the  0  estimates,  and  the  correlation  of  0  and  §• 

Mattson  (1983)  also  examined  the  performance  of  adaptive  testing  strategies 
under  conditions  of  error  in  item  parameter  estimates,  using  monte  carlo  simula¬ 
tion.  Mattson  extended  the  Crichton  study  by  studying  similar  effects  in  the 
one-  and  two-parameter  logistic  models,  in  addition  to  the  three-parameter  mod¬ 
el.  Whereas  Crichton  limited  her  trait  level  estimation  to  maximum  likelihood 
scoring  of  the  response  vectors,  Mattson  also  included  Bayesian  scoring  of  the 
maximum  information  and  Bayesian  adaptive  tests.  In  addition,  Mattson  allowed 
the  level  of  correlation  between  the  _a  and  _b  parameters  to  vary  at  four  differ¬ 
ent  levels,  as  well  as  examining  the  uncorrelated  condition  used  by  Crichton. 
Similar  to  Crichton,  Mattson  also  varied  test  length  from  10  to  30  items.  Fi¬ 
nally,  Mattson  allowed  errors  in  the  £  parameter  to  vary  at  two  levels,  examined 
four  levels  of  error  in  b_f  and  one  level  of  error  in  c.  All  conditions  were 
crossed  with  each  other.  Mattson’s  dependent  variables  were  the  same  as  those 
studied  by  Crichton. 

A  second  factor  that  can  affect  the  performance  of  adaptive  testing  strate¬ 
gies  in  a  realistic  item  pool  is  the  dimensionality  of  the  item  pool.  Since  all 
IRT  models  assume  a  unidimensional  item  pool,  deviations  from  unidimensionality 
would  be  expected  to  affect  the  performance  of  adaptive  testing  strategies  in 
real  item  pools,  which  are  rarely  (if  ever)  strictly  unidimensional.  As  a  re¬ 
sult  Suhadolnik  and  Weiss  (1985)  examined  the  robustness  of  adaptive  testing  to 
mult Idimens  tonality. 

In  this  study,  the  maximum  information  adaptive  testing  strategy  using  max¬ 
imum  liklihood  scoring  was  applied  to  datasets  varying  from  strictly  unidimen¬ 
sional  to  four-factor  datasets  that  reflected  the  structure  of  the  most  multidi¬ 
mensional  subtest  of  the  Armed  Service  Vocational  Aptitude  Battery.  Between 
these  extremes  were  two-  and  three-factor  datasets  In  which  the  second  and  third 
factors  accounted  for  varying  proportions  of  variance  in  comparison  to  the  first 
factor,  thus  simulating  item  structures  varying  from  very  little  multidimension¬ 
ality,  to  a  very  high  degree  of  multidimensionality.  A  total  of  45  data  struc¬ 
tures  was  examined. 

To  evaluate  the  effects  of  multidimensionality,  dichotomous  item  responses 
were  simulated  from  the  specified  multidimensional  structures.  These  item  re- 
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sponses  were  then  treated  as  if  they  were  derived  from  a  unidimensonal  model, 
and  adaptive  testing  was  implemented  using  the  item  response  vectors.  To  evalu¬ 
ate  the  performance  of  the  maximum  information  adaptive  testing  strategy  under 
multidimensionality,  the  conditional  bias,  inaccuracy,  and  root  mean  square  er¬ 
ror  of  the  9  estimates  was  computed  relative  to  the  true  first  factor  0  from  the 
multidimensional  structure. 

Response  modes,  test  item  formats,  and  effects  of  test  administration  vari¬ 
ables.  The  administration  of  ability  tests  by  interactive  computers  allows  the 
use  of  item  types  that  transcend  the  typical  dichotomously-scored  multiple- 
choice  test  item.  Research  Report  83-3  examined  aspects  of  a  probabilistic  re¬ 
sponse  mode  used  in  conjunction  with  the  typical  multiple-choice  item  format. 
This  response  mode  was  chosen  as  one  means  of  extracting  additional  information 
from  a  multiple-choice  item,  rather  than  simply  requiring  a  choice  of  a  single 
response  alternative. 

A  major  problem  with  probabilistic  responding  to  multiple-choice  items  in 
conventional  paper-and-penc il  test  administration  is  that  examinees  do  not  al¬ 
ways  follow  the  instructions  carefully  so  that  the  probabilities  they  assign  to 
the  item  responses  does  not  always  sum  to  1.00.  As  a  consequence,  large  amounts 
of  data  might  be  lost  for  a  given  examinee.  When  multiple-choice  items  are  an¬ 
swered  in  a  probabilistic  mode  on  a  computer  terminal,  however,  the  validity  of 
the  distribution  of  the  probabilities  can  be  checked  immediately  for  each  indi¬ 
viduals  responses  to  each  test  item,  and  invalid  responses  can  be  adjusted  un¬ 
til  they  meet  the  appropriate  criteria. 

The  utility  of  the  probabilistic  response  mode  was  examined  first  by  com¬ 
paring  the  usefulness  of  different  scoring  formulas  associated  with  the  response 
mode.  Then,  the  factor  structure  resulting  from  the  pr obabi i istic  response  mode 
was  studied  in  comparison  to  the  factor  structure  obtained  from  scoring  the  re¬ 
sponses  dichotomously.  Also  examined  in  Research  Report  83-3  were  the  validi¬ 
ties  of  the  scores  obtained  from  the  different  scoring  methods,  their  reliabili¬ 
ties,  and  the  effects  of  certainty  or  risk-taking  on  the  probabilistic  scores. 

Thompson* s  (1983)  study  also  involved  the  administration  of  items  in  dif¬ 
ferent  response  formats  to  college  students.  The  study  crossed  two  response 
formats  (categorical  and  probabilistic)  with  two  item  types  (multiple-choice  and 
dichotomous)  to  obtain  four  different  types  of  test  items.  These  were  (1)  the 
conventional  multiple-choice  item;  (2)  a  probabilistic  multiple-choice  item, 
similar  to  that  used  in  Research  Report  83-3;  (3)  a  dichotomous  (yes,  no)  item; 
and  (4)  a  dichotomous-probabilistic  item  in  which  an  examinee  answered  by  stat¬ 
ing,  with  a  number  between  0  and  100,  his/her  confidence  that  the  answer  to  the 
question  was  the  correct  answer.  Similar  to  Research  Report  83-3,  Thompson  in¬ 
vestigated  several  scoring  systems  for  the  probabilistic  items.  In  addition, 
the  four  test  item  types  were  evaluated  in  terms  of  the  intercorre lations  of  the 
scores  they  provided,  their  reliabilities,  and  their  factor  structures. 

One  other  factor  related  to  adaptive  testing  examined  in  this  project  con¬ 
cerned  the  effects  of  test  administration  variables  on  ability  test  performance 
and  psychological  reactions  to  testing.  This  study  (Research  Report  81-2)  in¬ 
vestigated  the  effects  of  two  variables  unique  to  computer  administration.  One 
variable  was  immediate  knowledge  of  results  of  the  correctness  of  each  item  re- 
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sponse  daring  the  process  of  test  administration*  The  second  variable — pacing 
of  item  presentation — was  concerned  with  whether  the  pace  of  the  test  adminis¬ 
tration  was  controlled  by  the  examinee  or  by  the  computer.  The  two  variables 
were  studied  in  both  computer-administered  conventional  and  adaptive  tests.  The 
dependent  variables  included  ability  test  performance  (maximum  likelihood  '  es¬ 
timates  and  proportion  correct),  response  pattern  information,  item  response 
latencies,  and  psychological  reactions  to  testing.  Data  were  obtained  from  477 
college  students  who  were  randomly  assigned  to  the  experimental  conditions. 

Major  Findings 


Adaptive  Achievement  Testing 

1.  Adaptive  intersubtest  branching  is  a  feasible  approach  to  improving  the 
efficiency  of  test  administration  when  a  test  battery  is  adaptively  admin¬ 
istered.  This  approach  can  reduce  test  battery  length  by  50%  or  more  with 
no  appreciable  effect  on  the  psychometric  characteristics  of  scores  on  the 
tests  in  the  battery  (Research  Reports  79-6  and  81-4).  Although  the  major 
reductions  in  test  battery  length  were  attributable  to  adaptive  intrasub¬ 
test  item  selection,  there  were  additional  small  reductions  in  test  length 
due  to  intersubtest  branching.  Intersubtest  branching  also  resulted  in 
test  battery  information  levels  that  closely  approximated  those  of  the  full 
test  battery,  in  comparison  to  information  levels  obtained  solely  from  the 
use  of  adaptive  intrasubtest  item  selection  (Research  Report  81-4).  Re¬ 
sults  also  indicated  (Research  Report  81-4)  that  the  order  in  which  sub¬ 
tests  were  selected  for  intersubtest  branching  had  no  effect  on  either  the 
efficiency  of  test  administration  or  on  the  psychometric  characteristics  of 
the  resulting  test  scores. 

2.  The  use  of  change  scores  to  measure  changes  in  achievement  over  time,  which 
assumes  that  the  factor  underlying  changes  in  performance  is  invariant,  may 
be  appropriate  in  some  achievement  testing  environments  and  not  in  others. 
Results  from  college  courses  (Research  Report  81-5)  indicated  that  the  fac¬ 
tor  structure  of  measured  achievement  in  a  biology  course  was  not  the  same 
prior  to  instruction  as  it  was  after  several  weeks  of  instruction.  In  a 
mathematics  course,  however,  the  factor  structure  of  measured  achievement 
did  not  change  over  a  10-week  period.  These  results  suggest  that  in  the 
absence  of  information  to  indicate  that  the  dimensionality  of  measured 
achievement  does  not  change  over  time,  it  is  inappropriate  to  compute  sim¬ 
ple  difference  scores  to  measure  changes  in  achievement  levels. 

3.  There  was  some  indication  (Kingsbury,  1984)  that  Bayesian  adaptive  tests 
using  an  individual  prior  achievement  level  estimate  resulted  in  more  reli¬ 
able  change  scores  than  were  obtained  from  comparable  conventional  achieve¬ 
ment  tests.  Further  research  is  needed,  however,  to  investigate  the  gener- 
aiizability  of  these  findings  in  other  achievement  domains. 

4.  Adaptive  Mastery  Testing  (AMT)  is  a  viable  procedure  for  reducing  test 
length  of  mastery  tests  and  improving  the  efficiency  of  mastery  classifica¬ 
tions.  In  monte  carlo  simulation  (Research  Report  80-4)  AMT  achieved  the 
best  combination  of  test  length  reduction  and  validity  of  mastery  classifi¬ 
cations  in  comparison  with  a  sequential  probability  ratio  classification 
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procedure  and  conventional  tests  scored  by  proportion  correct  and  IRT-based 
Bayesian  scoring.  The  advantages  of  AMT  were  most  pronounced  in  realistic 
item  pools  in  which  items  varied  in  difficulties  and  discriminations.  AMT 
also  tended  to  result  in  a  more  even  balance  of  false  mastery  and  false 
non-mastery  classifications  in  comparison  to  the  sequential  procedure. 

Results  from  the  simulation  study  were  supported  in  live  testing  (Research 
Report  81-3).  In  comparison  to  conventional  achievement  tests,  both  fixed- 
and  variable-length  AMTs  resulted  in  mastery  classifications  that  were  more 
consistent  with  an  independent  mastery  criterion.  The  average  variable- 
length  adaptive  test  was  able  to  make  a  high-confidence  classification  for 
students  using  only  from  2  to  5  items,  ^hus  reducing  test  lengths  as  much 
as  74%  to  88%  from  the  20-item  conventional  test,  with  no  loss  in  classifi¬ 
cation  accuracy. 

Adaptive  Ability  Testing 

3.  Owen's  Bayesian  adaptive  testing  strategy  results  in  9  estimates  that,  un¬ 
der  realistic  testing  conditions,  are  biased  and  not  of  equal  precision 
across  0  levels  (Research  Report  83-2  and  Weiss  &  McBride,  1984).  Only 
under  the  unrealistic  situation  in  which  true  0  was  used  as  the  prior  0  did 
Owen's  procedure  result  in  unbiased  6  estimates  and  reasonably  horizontal 
information  functions.  Bias  was  also  differentially  affected  by  item  dis¬ 
criminations  for  var iable-length  tests.  In  addition,  for  these  tests,  test 
length  was  an  increasing  function  of  0.  The  design  of  these  studies  al¬ 
lowed  identification  of  the  source  of  the  bias  to  be  the  use  of  a  constant 
(group)  prior  0  estimate  to  begin  the  Bayesian  adaptive  testing. 

6.  Krrors  in  item  parameter  estimates  do  not  seriously  affect  the  performance 
of  adaptive  testing  strategies  (Crichton,  1981;  Mattson,  1983).  In  the 
3-parameter  data  (Crichton,  1981)  using  indices  combined  across  0  levels, 
when  error  was  introduced  into  the  separate  item  parameter  estimates  the 
effects  were  small  for  errors  within  the  usually  observed  range.  The  a  and 
_b  parameters  generally  had  similar  effects  on  adaptive  test  performance, 
while  errors  in  the  £  parameter  had  negligible  effects.  When  errors  in  the 
three  parameters  were  combined,  effects  differed  little  from  the  case  with 
error  in  a_  or  Jb,  except  for  very  unrealistic  levels  of  error.  There  were 
no  appreciable  differences  in  susceptibility  to  error  among  the  stradap- 
tive,  maximum  information,  and  Bayesian  adaptive  testing  strategies. 

7.  When  indices  conditional  on  9  were  examined  in  the  3-parameter  data 
(Crichton,  1981),  Bayesian  and  maximum  information  adaptive  tests  were 
somewhat  less  susceptible  to  errors  in  item  parameter  estimates  than  was 
the  stradaptive  test.  Whereas  errors  in  estimation  of  the  b  and  c  parame¬ 
ters  had  little  effect  on  the  conditional  indices,  estimation  errors  in  the 
a_  parameter  resulted  in  the  major  effects  on  the  conditional  indices,  indi¬ 
cating  that  large  errors  in  estimating  £  may  deteriorate  the  performance  of 
the  adaptive  testing  strategies.  Even  with  this  deterioration  in  perfor¬ 
mance,  however,  the  adaptive  tests  still  performed  better  than  the  conven¬ 
tional  tests  for  a  substantial  portion  of  the  0  range. 

8.  When  ji  and  ^b  parameter  estimates  were  allowed  to  correlate  with  each  other, 
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as  they  do  in  many  real  item  pools,  there  was  no  additional  effect  on  6 
estimates  beyond  that  due  to  errors  in  uncorrelated  item  parameter  esti¬ 
mates  (Mattson,  1983). 

9.  Maximum  likelihood  0  estimation  performed  better  than  Bayesian  estimation 
for  lesser  degrees  of  error  in  the  item  parameters,  and  Bayesian  estimation 
was  less  affected  by  item  parameter  erors  for  more  extreme  levels  of  error, 
particularly  for  the  l-  and  2-parameter  models  (Mattson,  1983). 

10.  The  2-parameter  model  was  least  affected  by  errors  in  item  parameter  esti¬ 
mates  (Mattson,  1983).  Under  conditions  of  large  errors  in  item  parameter 
estimates,  the  2-parameter  modeL  performed  better  than  the  error-free  cases 
of  1-  and  3-parameter  models. 

11.  Mu  1 1 id imens iona 1 i ty  has  a  more  serious  effect  on  0  estimates  from  maximum 
information  adaptive  tests  than  does  errors  in  item  parameter  estimates 
(SuhaLdonik  &  Weiss,  1983).  For  multidimensional  structures  with  one  or 
two  factors  beyond  the  first  that  account  for  up  to  one-fourth  the  variance 
of  the  first  factor,  overcoming  the  effects  of  multidimensionality  would 
require  doubling  of  adaptive  test  length.  The  data  also  suggested  that  the 
number  of  factors,  and  not  simply  the  overall  strength  of  the  factor  struc¬ 
ture,  affects  0  estimates,  since  a  single  factor  beyond  the  first  had  less 
effect  than  did  two  factors  that  accounted  for  the  same  amount  of  variance. 
In  general,  however,  adaptive  testing  is  quite  robust  to  multidimensional 
structures  of  the  type  most  frequently  resulting  from  careful  item  selec¬ 
tion — i.e.,  factor  structures  with  a  strong  first  factor  and  second  or 
third  factors  that  account  for  less  than  one-eighth  of  the  variance  of  the 
first  factor. 

12.  Administration  of  multiple-choice  items  in  a  probabilistic  response  mode 
may  be  a  useful  application  of  computerized  test  administration.  Although 
items  answered  in  a  probabilistic  mode  did  not  result  in  higher  validities 
than  multiple-choice  items  responded  to  dichotomously ,  the  probabilistic 
mode  resulted  in  higher  reliabilities  and  a  stronger  first  factor  (Research 
Report  83-3).  Since  the  stronger  first  factor  would  result  in  higher  IRT 
item  discrimination  parameters  for  these  items,  adaptive  testing  based  on 
items  administered  probabilistically  would  likely  be  more  efficient,  re 
suiting  in  shorter  tests  or  in  more  precise  9  estimates.  Additional  analy¬ 
ses  of  item  formats  and  response  modes  (Thompson,  1983)  showed  that  items 
presented  in  a  dichotomous  format  yielded  different  factor  structures  than 
did  multiple-choice  formats,  but  supported  the  higher  reliabilities  ob¬ 
served  in  Research  Report  83-3  for  the  probabilistic  response  format. 

13.  Computerized  test  administration  variables — including  adaptive  vs.  conven¬ 

tional  test  type,  computer-  vs.  self-paced  item  administration,  and  immedi¬ 
ate  knowledge  of  results  after  each  item  is  administered — do  not  have  di¬ 
rect  effects  on  ability  test  performance,  as  measured  by  estimated  0  levels 
(Research  Report  81-2).  These  test  administration  variables  do,  however, 
have  effects  on  psychological  reactions  to  testing.  Immediate  knowledge  of 

results  appears  to  have  a  standardizing  effect  on  test  anxiety  and  test¬ 
taking  motivation,  since  mean  levels  of  anxiety  and  motivation  were  differ¬ 
ent  when  knowledge  of  results  was  provided  but  similar  when  it  was  not. 
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to  each  alternative  as  the  item  score.  Total  test  scores  for  all  of  the  scoring 
methods  were  obtained  by  summing  individual  item  scores. 

Several  studies  using  probabilistic  response  methods  have  shown  the  effect  of  a 
response-style  variable,  called  certainty  or  risk  taking,  on  scores  obtained 
from  probabilistic  responses.  Results  from  this  study  showed  a  small  effect  of 
certainty  on  the  probabilistic  scores  in  terras  of  the  validity  of  the  scores  but 
no  effect  at  all  on  the  factor  structure  or  internal  consistency  of  the  scores. 
Once  the  effect  of  certainty  on  the  probabilistic  scores  had  been  ruled  out,  the 
five  scoring  formulas  were  compared  in  terms  of  validity,  reliability,  and  fac¬ 
tor  structure.  There  were  no  differences  in  the  validity  of  the  scores  from  the 
different  methods,  but  scores  obtained  from  the  two  scoring  formulas  that  were 
not  reproducing  scoring  systems  were  more  reliable  and  had  stronger  first  fac¬ 
tors  then  the  scores  obtained  using  the  reproducing  scoring  systems.  For  prac¬ 
tical  use,  however,  the  reproducing  scoring  systems  may  have  an  advantage  be¬ 
cause  they  maximize  examinees1  scores  when  examinees  respond  honestly,  while 
honest  responses  will  not  necessarily  maximize  an  examinee’s  score  with  the  oth¬ 
er  two  methods.  If  a  reproducing  scoring  system  is  used  for  this  reason,  the 
spherical  scoring  formula  is  recommended,  since  it  was  the  most  internally  con¬ 
sistent  and  showed  the  strongest  first  factor  of  the  reproducing  scoring  sys¬ 
tems  • 
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achievement  levels  increased  the  underlying  factor  structure  remained  unchanged. 
The  implications  of  these  results  for  psychology,  education,  and  program 
evaluation  are  noted.  (AD  A1 10955) 


Research  Report  83-2 

Bias  and  Information  of  Bayesian  Adaptive  Testing 
David  J.  Weiss  and  James  R.  McBride 
March  1983 

Monte  carLo  simulation  was  used  to  investigate  score  bias  and  information  char¬ 
acteristics  of  Owen’s  Bayesian  adaptive  testing  strategy,  and  to  examine  possi¬ 
ble  causes  of  score  bias.  Factors  investigated  in  three  related  studies  includ¬ 
ed  effects  of  an  accurate  prior  0  estimate,  effects  of  item  discrimination,  and 
effects  of  fixed  vs.  variable  test  length.  Data  were  generated  from  a  three- 
parameter  logistic  model  for  3,100  simulees  in  each  of  eight  data  sets;  Bayesian 
adaptive  tests  were  administered,  drawing  items  from  a  ’’perfect"  item  pool. 
Results  showed  that  the  Bayesian  adaptive  test  yielded  unbiased  0  estimates  and 
relatively  flat  information  functions  only  in  the  unrealistic  situation  in  which 
an  accurate  prior  0  estimate  was  used.  When  a  more  realistic  constant  prior  Q 
estimate  was  used  with  a  fixed  test  length,  severe  bias  was  observed  that  varied 
with  item  discrimination.  A  different  pattern  of  bias  was  observed  with  varia¬ 
ble  test  length  and  a  constant  prior.  Information  curves  for  the  constant  prior 
conditions  generally  became  more  peaked  and  asymmetric  with  increasing  item  dis¬ 
crimination.  In  the  variable  test  length  condition  the  test  length  required  to 
achieve  a  specified  level  of  the  posterior  variance  of  0  estimates  was  an 
increasing  function  of  0  level.  These  results  indicate  that  Q  estimates  from 
Owen’s  Bayesian  adaptive  testing  method  are  affected  by  the  prior  0  estimate 
used  and  that  the  method  does  not  provide  measurements  that  are  unbiased  and 
equi precise  except  under  the  unrealistic  condition  of  an  accurate  prior  0  esti¬ 
mate.  (AD  A129280) 


Research  Report  83-3 

Effect  of  Examinee  Certainty  on  Probabilistic  Test  Scores 
and  a  Comparison  of  Scoring  Methods  for  Probabilistic  Responses 
Debra  Suhadolnik  and  David  J.  Weiss 
July  1983 

The  present  study  was  an  attempt  to  alleviate  some  of  the  difficulties  inherent 
in  multiple-choice  items  by  having  examinees  respond  to  multiple-choice  items  in 
a  probabilistic  manner.  Using  this  format,  examinees  are  able  to  respond  to 
each  alternative  and  to  provide  indications  of  any  partial  knowledge  they  may 
possess  concerning  the  item.  The  items  used  in  this  study  were  30  multiple- 
choice  analogy  items.  Examinees  were  asked  to  distribute  100  points  among  the 
four  alternatives  for  each  item  according  to  how  confident  they  were  that  each 
alternative  was  the  correct  answer.  Each  item  was  scored  using  five  different 
scoring  formulas.  Three  of  these  scoring  formulas — the  spherical,  quadratic, 
and  truncated  log  scoring  methods — were  reproducing  scoring  systems.  The  fourth 
scoring  method  used  the  probability  assigned  to  the  correct  alternative  as  the 
Item  score,  and  the  fifth  used  a  function  of  the  absolute  difference  between  the 
correct  response  vector  for  the  four  alternatives  and  the  actual  points  assigned 
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Research  Report  81-4 

Factors  Influencing  the  Psychometric  Characteristics  of  an 
Adaptive  Testing  Strategy  for  Test  Batteries 
Vincent  A.  Maurelli  and  David  J.  Weiss 
November  1981 

A  monte  carlo  simulation  was  conducted  to  assess  the  effects  in  an  adaptive 
testing  strategy  for  test  batteries  of  varying  subtest  order,  subtest  termina¬ 
tion  criterion,  and  variable  versus  fixed  entry  on  the  psychometric  properties 
of  an  existent  achievement  test  battery*  Comparisons  were  made  among  conven¬ 
tionally  administered  tests  and  adaptive  tests  using  adaptive  intra-subtest  item 
selection  with  and  without  inter-subtest  branching.  Data  consisted  of  responses 
of  300  simulees  to  a  201-item  achievement  test  battery.  Mean  test  battery 
length  was  reduced  from  42.5%  to  52.3%  using  adaptive  intra-subtest  item  selec¬ 
tion  with  variable  termination.  Reductions  in  mean  subtest  lengths  ranged  from 
27%  to  67%.  When  inter-subtest  branching  was  added,  additional  test  length  re¬ 
ductions  of  1%  to  2%  were  observed  for  individual  subtests.  The  reductions  in 
test  length  were  achieved  with  no  significant  loss  of  fidelity  or  psychometric 
information.  The  addition  of  inter-subtest  branching  resulted  in  levels  of  mean 
test  battery  information  more  similar  to  those  of  the  full  test  battery,  even 
with  mean  test  battery  reductions  of  50%  in  number  of  items  administered.  Sub¬ 
test  order  was  shown  to  have  no  effect  on  the  evaluative  criteria  employed.  The 
results  generally  supported  previous  studies  of  this  adaptive  testing  strategy. 
Suggestions  for  future  research  are  presented.  (AD  A109666) 


Research  Report  81-5 

Dimensionality  of  Measured  Achievement  Over  Time 
Kathleen  A#  Gialluca  and  David  J.  Weiss 
December  1981 

Some  type  of  difference  or  change  score  is  frequently  used  to  quantify  the 
effects  of  experimental  treatments  and  educational  programs  on  individuals  and 
on  groups  of  individuals.  Whether  the  change  measurement  involves  the  use  of 
simple  difference  scores,  their  derivatives,  or  some  more  complex  methodological 
design,  the  measurement  process  itself  assumes  that  the  treatment  or  instruction 
results  in  higher  levels  of  the  originally  measured  variable  and  that  the  only 
change  that  occurs  is  a  quantitative  one.  If  this  assumption  is  not  met,  then 
the  computation  of  any  type  of  difference  score  is  inappropriate  and  the  scores 
themselves  are  useless  for  measuring  growth  or  change. 

Two  studies  investigated  the  tenability  of  the  assumption  that  classroom  in¬ 
struction  results  in  increases  in  students’  achievement  levels  while  the  quali¬ 
tative  nature  of  that  achievement  remains  constant  across  time.  The  data  util¬ 
ized  were  the  item  responses  to  tests  in  basic  mathematics  and  in  general  biolo¬ 
gy  administered  as  pretests  and  after  instruction  to  students  enrolled  in  those 
courses. 

Results  Indicated  that  this  assumption  was  not  tenable  in  the  biology  data  set, 
where  Increases  in  mean  achievement  level  were  accompanied  by  corresponding 
changes  in  the  factor  structure  underlying  the  item  responses.  For  the  mathe¬ 
matics  data,  however,  there  was  no  such  violation  of  the  assumption:  As  student 
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These  results  indicate  that  testing  conditions  may  interact  in  a  complex  way  to 
determine  psychological  reactions  to  the  testing  environment*  The  interactions 
do  suggest,  however,  a  somewhat  consistent  standardizing  effect  of  KR  on  test 
anxiety  and  test-taking  motivation*  This  standardizing  effect  of  KR  showed  that 
approximately  equal  levels  of  motivation  and  anxiety  were  reported  under  the 
various  testing  conditions  when  KR  was  provided,  but  that  mean  levels  of  these 
variables  were  substantially  different  when  KR  was  not  provided*  Consistent 
with  theoretical  expectations,  the  conventional  test  was  perceived  as  being 
either  too  easy  or  too  difficult,  whereas  the  adaptive  tests  were  perceived  more 
often  as  being  of  appropriate  difficulty. 

The  results  concerning  the  effects  of  KR  on  test  performance,  motivation,  and 
anxiety  found  in  this  study  were  contrary  to  earlier  reported  findings;  and  dif¬ 
ferences  in  the  studies  are  delineated.  Recommendations  are  made  concerning  the 
control  of  specific  testing  conditions,  such  as  difficulty  of  the  test  and  abil¬ 
ity  level  of  the  examinee  population,  as  well  as  suggestions  for  the  further 
analysis  of  the  standardizing  effect  of  KR.  (AD  A097688) 


Research  Report  81-3 

A  Validity  Comparison  of  Adaptive  and  Conventional 
Strategies  for  Mastery  Testing 
G*  Gage  Kingsbury  and  David  J*  Weiss 
September  1981 

Conventional  mastery  tests  designed  to  make  optimal  mastery  classifications  were 
compared  with  fixed-length  and  variable-length  adaptive  mastery  tests  in  terms 
of  validity  of  decisions  with  respect  to  an  external  criterion  measure*  Compar¬ 
isons  between  the  testing  procedures  were  made  across  five  content  areas  in  an 
introductory  biology  course  from  tests  administered  to  over  400  volunteer  stu¬ 
dents*  The  criterion  measure  used  was  the  student's  final  standing  in  the 
course,  based  on  course  examinations  and  laboratory  grades*  Results  indicated 
that  the  adaptive  test  resulted  in  mastery  classifications  that  were  more  con¬ 
sistent  with  final  class  standing  than  those  obtained  from  the  conventional 
test.  This  result  was  observed  within  individual  content  areas  and  for  discrim¬ 
inant  analysis  classifications  made  across  content  areas*  This  result  was  also 
observed  for  two  scoring  procedures  used  with  the  conventional  test  (proportion- 
correct  and  Bayesian  scoring)*  Results  also  indicated  that  there  was  no  decre¬ 
ment  in  the  performance  of  the  adaptive  test  when  a  variable  termination  rule 
was  implemented.  This  variable  termination  rule  resulted  in  test  lengths  which 
were,  on  the  average,  74%  to  88%  shorter  than  the  original  adaptive  tests*  Fur¬ 
ther  analyses  explicated  the  manner  in  which  the  adaptive  tests  administered 
differed  from  the  conventional  test  for  each  content  area  as  a  function  of 
achievement  level.  This  evidence  was  used  to  explain  why  the  adaptive  tests 
resulted  in  more  valid  decisions  than  the  conventional  procedure,  in  spite  of 
the  fact  that  the  type  of  conventional  test  used  here  was  the  most  informative 
test  concerning  the  mastery  cutoff.  It  is  concluded  that  variable-length  adap¬ 
tive  mastery  tests  can  provide  more  valid  mastery  classifications  than  ’’optimal” 
conventional  mastery  tests  while  reducing  test  length  an  average  of  80%  from  the 
length  of  the  conventional  tests*  (AD  A106867) 


plied  to  problems  of  item  option  weighting  and  adaptive  testing.  Important  de¬ 
velopments  with  these  models  during  the  period  included  the  demonstration  of 
their  relationship  with  other  psychological  measurement  models,  and  methods  for 
determining  fit  of  individuals  to  IRT  models.  As  another  alternative  to  classi¬ 
cal  test  theory,  order  models  were  developed  and  studied,  and  several  other  mod¬ 
els  were  proposed. 

Validity  issues  were  also  studied  during  this  period.  A  number  of  approaches  to 
the  analysis  of  mul t i t rai t-mu 1 t ime t hod  matrices  were  proposed  and  compared,  in¬ 
cluding  some  based  on  structural  equations  models.  Issues  of  predictive  validi¬ 
ty  studied  included  necessary  sample  sizes,  vaLidity  generalization,  and  modera¬ 
tor  and  suppressor  effects.  Test  fairness  issues  and  their  effects  on  validity 
received  considerable  attention.  Concern  was  with  (1)  bias  in  selection;  (2) 
fairness  to  minorities,  including  differential  and  single-groups  validity  and 
comparisons  of  regression  lines,  adverse  impact,  and  bias  in  test  content;  and 
(3)  fairness  to  women. 

It  is  concluded  that  little  of  consequence  was  accomplished  in  classical  test 
theory  during  this  period.  n  e  most  important  developments  were  in  alternatives 
to  classical  test  theory,  primarily  item  response  theory.  Research  in  this  area 
resulted  in  data  and  other  developments  that  will  permit  a  better  understanding 
of  the  range  of  applicability  of  these  models  and  their  potential  for  solving 
measurement  problems  not  solvable  by  classical  models.  (AD  A096157) 


Research  Report  81-2 

Effects  of  Immediate  Feedback  and  Pacing  of  Item  Presentation 
on  Ability  Test  Performance  and  Psychological  Reactions  to  Testing 
Marilyn  F.  Johnson,  David  J.  Weiss,  and  J.  Stephen  Prestwood 

February  1981 

The  study  investigated  the  joint  effects  of  knowledge  of  results  (KR  or  no-KR) , 
pacing  of  item  presentation  (computer  or  self-pacing),  and  type  of  testing 
strategy  (50-item  peaked  conventional,  variable-length  stradaptive,  or  50-item 
fixed-length  stradaptive  test)  on  ability  test  performance,  test  item  response 
latency,  information,  and  psychological  reactions  to  testing.  The  psychological 
reactions  to  testing  were  obtained  from  Likert-type  items  that  assessed  test¬ 
taking  anxiety,  motivation,  perception  of  difficulty,  and  reactions  to  knowledge 
of  results.  Data  were  obtained  from  447  college  students  randomly  assigned  to 
one  of  the  12  experimental  conditions. 

The  results  indicated  that  there  were  no  effects  on  ability  estimates  due  to 
knowledge  of  results,  testing  strategy,  or  pacing  of  item  presentation.  Al¬ 
though  average  latencies  were  greater  on  the  stradaptive  tests  than  on  the  con¬ 
ventional  test,  the  overall  testing  time  was  not  substantially  longer  on  the 
adaptive  tests  and  may  have  been  a  function  of  differences  in  test  difficulty. 
Analysis  of  information  values  Indicated  higher  levels  of  information  on  the 
stradaptive  tests  than  on  the  conventional  test.  There  was  no  statistically 
significant  main  effect  for  any  of  the  three  experimental  conditions  when  test 
anxiety  or  test-taking  motivation  were  the  dependent  variables,  although  there 
were  some  significant  Interaction  effects. 


The  concurrent  validity  analysis  showed  that  the  conventional  test  produced 
ability  level  estimates  that  correlated  more  highly  with  the  criterion  test 
scores  than  did  the  Bayesian  test  for  all  lengths  greater  than  four  items.  This 
result  was  observed  for  both  scoring  procedures  used  with  the  conventional  test. 

Limitations  of  the  study,  and  the  conclusions  that  may  be  drawn  from  it,  are 
discussed.  These  limitations,  which  may  have  affected  the  results  of  this 
study,  incLuded  possible  differences  in  the  alternate  forms  used  within  the  two 
testing  strategies,  the  relatively  small  calibration  samples  used  to  estimate 
the  ICC  parameters  for  the  items  used  in  the  study,  and  method  variance  in  the 
conventional  tests.  (AD  A094477) 


Research  Report  81-1 
Review  of  Test  Theory  and  Methods 
David  J.  Weiss  and  Mark  L.  Davison 
January  1981 

The  research  literature  on  test  theory  and  methods  for  the  period  1975  through 
early  1980  is  critically  reviewed.  Research  on  classical  test  theory  has  con¬ 
centrated  on  relatively  unimportant  developments  in  reliability  theory,  with 
some  new  developments  and  applications  of  generalizabill ty  theory  appearing  dur¬ 
ing  this  period.  The  reliability  of  change  or  gain  scores  has  received  some 
attention  from  the  classical  test  theory  perspective,  as  have  the  applications 
of  classical  reliability  concepts  in  experimental  design  and  the  analysis  of 
experimental  data.  A  minor  amount  of  research  with  classical  models  was  in  the 
area  of  test-score  equating.  Classical  item  analysis  procedures,  however,  re¬ 
ceived  little  attention.  A  fair  amount  of  research  during  the  period  was  devot¬ 
ed  to  different  item  types  and  test  item  response  modes  as  replacements  for  the 
ubiquitous  multiple-choice  item.  Several  types  of  true-false  items  were  pro¬ 
posed,  and  formula  scoring  was  studied  by  a  number  of  researchers  in  an  attempt 
to  reduce  guessing  effects.  The  perennial  topic  of  response  option  weighting 
received  attention,  with  efforts  oriented  toward  demonstrating  effects  on  valid¬ 
ity  and  reliability.  Response  modes  studied  included  answer-until-correct ,  con¬ 
fidence  weighting,  and  free-response. 

A  number  of  alternatives  to  classical  test  theory  were  studied  in  an  attempt  to 
solve  some  of  the  problems  for  which  classical  test  theory  has  proven  to  be 
inadequate.  Research  on  criterion-referenced  testing  continued  during  this  pe¬ 
riod.  Latent  trait  test  theory  (item  response  theory,  or  IRT)  received  consid¬ 
erable  attention.  Research  on  the  1-parameter  IRT  model  continued  to  address 
problems  of  parameter  estimation,  model  fit,  and  equating.  The  question  of  the 
person-free  and  sample-free  characteristics  of  this  model  (i.e.,  its  robustness) 
were  investigated,  with  results  generally  supporting  these  desirable  character¬ 
istics.  In  addition,  a  special  case  of  this  model  that  can  account  for  guessing 
was  developed,  and  the  model  was  generalized  and  successfully  applied  to  poly- 
chotomous  attitude  types  of  items.  Considerable  research  occurred  on  the  2-  and 
3-parameter  IRT  models.  The  concept  of  Information  as  a  replacement  for  classi¬ 
cal  reliability  concepts  was  studied,  and  its  uses  in  developing  parallel  tests 
were  described*  As  with  the  1-pararaeter  IRT  model,  problems  of  parameter  esti¬ 
mation  and  equating  were  investigated.  These  IRT  models  were  successfully  ap- 
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Research  Report  80-4 

A  Comparison  of  Adaptive,  Sequential ,  and  Conventional  Testing 
Strategies  for  Mastery  Decisions 
G.  Gage  Kingsbury  and  David  J.  Weiss 
November  1980 

Two  procedures  for  making  mastery  decisions  with  variable  length  tests  and  a 
conventional  mastery  testing  procedure  were  compared  in  monte  carlo  simulation. 
The  simulation  varied  the  characteristics  of  the  item  pool  used  for  testing  and 
the  maximum  test  length  allowed.  The  procedures  were  compared  in  terms  of  the 
mean  test  length  needed  to  make  a  decision,  the  validity  of  the  decisions  made 
by  each  procedure,  and  the  types  of  classification  errors  made  by  each  proce¬ 
dure.  Both  of  the  variable  test  length  procedures  were  found  to  result  in  im¬ 
portant  reductions  in  mean  test  length  from  the  conventional  test  length.  The 
Sequential  Probability  Ratio  Test  (SPRT)  procedure  resulted  in  greater  test 
Length  reductions,  on  the  average,  than  the  Adaptive  Mastery  Testing  (AMT)  pro¬ 
cedure.  However,  the  AMT  procedure  resul  ed  both  in  more  valid  mastery  deci¬ 
sions  and  in  more  balanced  error  rates  than  the  SPRT  procedure  under  all  condi¬ 
tions.  In  addition,  the  AMT  procedure  produced  the  best  combination  of  test 
length  and  validity.  (AD  A094478) 


Research  Report  80-5 

An  Alternate-Forms  Reliability  and  Concurrent  Validity 
Comparison  of  Bayesian  Adaptive  and  Conventional  Ability  Tests 
G.  Gage  Kingsbury  and  David  J.  Weiss 
December  1980 

Two  30-item  alternate  forms  of  a  conventional  test  and  a  Bayesian  adaptive  test 
were  administered  by  computer  to  472  undergraduate  psychology  students.  In 
addition,  each  student  completed  a  120-item  paper-and-pencil  test,  which  served 
as  a  concurrent  validity  criterion  test,  and  a  series  of  very  easy  questions 
designed  to  detect  students  who  were  not  answering  conscientiously.  All  test 
items  were  five-alternative  multiple-choice  vocabulary  items.  Reliability  and 
concurrent  validity  of  the  two  testing  strategies  were  evaluated  after  the  ad¬ 
ministration  of  each  item  for  each  of  the  tests,  so  that  trends  indicating  dif¬ 
ferences  in  the  testing  strategies  as  a  function  of  test  length  could  be  detect¬ 
ed.  For  each  test,  additional  analyses  were  conducted  to  determine  whether  the 
two  forms  of  the  test  were  operationally  alternate  forms. 

Results  of  the  analysis  of  alternate-forms  correspondence  Indicated  that  for  all 
test  lengths  greater  than  10  items,  each  of  the  alternate  forms  for  the  two  test 
types  resulted  in  fairly  constant  mean  ability  level  estimates.  When  the  scor¬ 
ing  procedure  was  equated,  the  mean  ability  levels  estimated  from  the  two  forms 
of  the  conventional  test  differed  to  a  greater  extent  than  those  estimated  from 
the  two  forms  of  the  Bayesian  adaptive  test. 

The  alternate-forms  reliability  analysis  indicated  that  the  two  forms  of  the 
Bayesian  test  resulted  in  more  reliable  scores  than  the  two  forms  of  the  conven¬ 
tional  test  for  all  test  lengths  greater  than  two  Items.  This  result  was  ob¬ 
served  when  the  conventional  test  was  scored  either  by  the  Bayesian  or  propor¬ 
tion-correct  method. 
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Research  Report  79-6 

Efficiency  of  an  Adaptive  Inter-Subtest  Branching  Strategy 
in  the  Measurement  of  Classroom  Achievement 
Kathleen  A,  Gialiuca  and  David  J.  Weiss 
November  1979 

A  real-data  simulation  was  conducted  to  investigate  the  efficiency  of  an  adap¬ 
tive  testing  strategy  designed  for  achievement  test  batteries  applied  to  a 
classroom  achievement  test.  This  testing  strategy  combined  adaptive  item  selec¬ 
tion  routines  both  within  and  between  the  subtests  of  the  test  battery.  Compar¬ 
isons  were  made  between  the  conventionally-administered  tests  and  the  simulated 
adaptive  tests  in  terms  of  test  length,  psychometric  information,  and  correla¬ 
tions  of  achievement  estimates.  Design  of  the  study  also  permitted  (1)  separa¬ 
tion  of  the  effects  of  the  adaptive  intra-subtest  item  selection  procedure  and 
inter-subtest  branching,  (2)  evaluation  of  the  effects  of  different  intra-sub¬ 
test  termination  criteria,  (3)  use  of  classical  regression  equations  and  regres¬ 
sion  equations  corrected  for  errors  of  measurement  in  the  predictors,  and  (4) 
cross-validation  stability  of  the  inter-subtest  branching  regression  predic¬ 
tions.  Data  consisted  of  the  responses  from  1,600  students  to  classroom-admin¬ 
istered  final  exams  in  a  general  biology  course  at  the  University  of  Minnesota. 

Total  test  length  was  reduced  from  16%  to  30%  using  the  adaptive  intra-subtest 
item  selection  strategy  with  a  variable  termination  criterion  that  omits  those 
items  providing  little  information  to  the  measurement  process.  Subtest-length 
reductions  ranged  from  about  8%  to  62%.  Total  test  length  was  reduced  another 
1%  to  5%  (with  subtest-length  reductions  of  up  to  53%)  upon  the  addition  of  an 
inter-subtest  branching  strategy  that  utilized  regression  equations  with  prior 
information  concerning  a  student's,  performance. 

Reductions  in  subtest  length  were  accomplished  with  virtually  no  loss  in  psycho¬ 
metric  information.  Correlations  between  the  Bayesian  achievement  estimates 
from  the  adaptive  and  conventional  tests  were  uniformly  high,  typically  r^  *  .90 
and  higher.  Results  showed  that  the  use  of  the  corrected  regression  equations 
did  little  to  improve  the  performance  of  the  inter-subtest  branching;  although 
the  multiple  correlations  for  the  corrected  equations  were  higher,  both  the  in¬ 
formation  curves  and  correlations  of  achievement  estimates  were  generally  lower. 
Cross-validation  results  indicated  that  the  procedure  can  be  used  in  different 
samples  from  the  same  population. 

Results  from  this  study  generally  supported  the  generality  of  this  adaptive 
testing  strategy  for  reducing  achievement  test  length  with  no  adverse  impact  on 
the  quality  of  the  measurements.  Suggestions  are  made  for  further  research  with 
this  testing  strategy.  (AD  A080956) 


Perceptions  of  test  difficulty  were  different  for  adaptive  and  conventional 
tests;  students  accurately  perceived  the  conventional  tests  as  either  too 
easy  or  too  difficult,  depending  on  their  ability  levels,  while  the  adap¬ 
tive  test  was  generally  accurately  perceived  as  being  of  appropriate  diffi¬ 
culty. 
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